
Insurance fraud is a deliberately false or misrepresented claim by an insured/claimant or entity for the purpose of financial gain.It is one of the largest and most well-known problems that insurers face. Fraudulent claims can be highly expensive for each insurer. Therefore, it is important to know which claims are correct and which are not. It is not doable for insurance companies to check all claims personally since this will cost simply too much time and money. Fraud can be committed at different touchpoints in the insurance lifecycle by insured applicants, policyholders, third-party claimants or professionals such as insurance agencies/agents who provide such services.
The goal of this project is to build a model that can detect auto insurance fraud.
The largest asset which insurers have in the fight against fraud is data. The raw data has 40 variables including the target ‘Fraud Reported’. Some of the variables include names like policy number, policy bind date, policy annual premium, incident severity, incident location, auto model.
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, StandardScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score, classification_report, confusion_matrix
import warnings
warnings.simplefilter("ignore")
# Reading data
df = pd.read_csv('insurance_claims.csv')
pd.set_option('display.max_columns', None)
# Five row of data
df.head()
# Last five rows of data
df.tail()
# Shape of data
print("The dataset has {} rows and {} columns. \n" .format(df.shape[0], df.shape[1]))
# Datatypes
df.dtypes
# Non null values, data type
df.info()
# describe- statistical description of the numerical variables in our dataset
df.describe()
# Check for missing values
df.isnull().sum()
# variable _c39 has 1000 missing values and has to be dropped
df.drop('_c39', inplace=True, axis=1)
# Unique value count of each categorical feature
df1 = df.select_dtypes(include=[object])
print(df1.nunique())
Data visualization provides a good, organized pictorial representation of the data which makes it easier to understand, observe, analyze. We are able to observe and interpret different patterns and trends which exist in our data.
# histogram of Fraud_reported
px.histogram(df,x="fraud_reported",color='fraud_reported',title="Fraud_reported count")
They are more non fraudulent counts than fraudulent counts
# Pie chart of fraud reported
label_cnts = df.fraud_reported.value_counts()
# Plot value_counts
px.pie(names = ["No Fraud Reported", "Fraud Reported"],values = label_cnts.values,title="Fraud reported",height=700,width=700)
75.3% claims are non fraudulent while 24.7% are fraudulent claims
# Histogram of incident type
px.histogram(df,"incident_type", color= "fraud_reported", title="Incident type")
People with incident type Single Vehicle Collision and Multi Vehicle Collision filed the highest claims
# Histogram of incident type accorded to fraud reported
df2 = df[df['fraud_reported']=='Y']
px.histogram(df2,"incident_type",color="fraud_reported",title="Incident type according to fraud reported")
# Pie chart of incident type with frauds reported
incident_type_count = df2.incident_type.value_counts()
# Plot value_counts
px.pie(names = incident_type_count.index,values = incident_type_count.values,title="incident type according to fraud reported",height=700,width=700)
Single Vehicle collision and Multi-Vehicle Collision have more fraudulent claims than any other claim_type
# Histogram of insured sex
px.histogram(df,"insured_sex",color="fraud_reported",title="Insured sex ")
# Pie chart of incident type with frauds reported
insured_sex_count = df2.insured_sex.value_counts()
# Plot value_counts
px.pie(names = insured_sex_count.index,values = insured_sex_count.values,title="Sex type per fraudulent claims",height=700,width=700)
Surprisingly women committed more fraud that men
# Histogram of eductional level
px.histogram(df, x = "insured_education_level",color="fraud_reported",title="education level")
# Pie chart of insured education level according to fraud reported
insured_education_level_count = df2.insured_education_level.value_counts()
# Plot value_counts
px.pie(names = insured_education_level_count.index,values = insured_education_level_count.values,title="insured education level according to fraudulent reported",height=700,width=700)
Those with the educational level equal to Juris Doctor (JD) have the highest fraud reported claims
# Incident severity
px.histogram(df, x = "incident_severity",color="fraud_reported",title="incident severity")
Those with Minor Damge filed more claims while those with major damage had more fraudulent claims
# Pie chart of incident severity according to fraud reported
incident_severity_count = df2.incident_severity.value_counts()
# Plot value_counts
px.pie(names = incident_severity_count.index,values = incident_severity_count.values,title="incident severity according to fraud",height=700,width=700)
People with major damage incidents filed more fraudulent claims
# hsitogram of insured occupation
px.histogram(df,"insured_occupation",color="fraud_reported",title="Insured occupation")
# Pie chart of insured occupation according to fraud reported
insured_occupation_count = df2.insured_occupation.value_counts()
# Plot value_counts
px.pie(names = insured_occupation_count.index,values = insured_occupation_count.values,title="insured occupation according to fraud",height=700,width=700)
claimants who hold exec managerial positions filed more frudulent claims
# Histogram of authorities contacted
px.histogram(df,"authorities_contacted",color="fraud_reported",title="Authorities contacted")
# Pie chart of insured occupation according to fraud reported
authorities_contacted_count = df2.authorities_contacted.value_counts()
# Plot value_counts
px.pie(names = authorities_contacted_count.index,values = authorities_contacted_count.values,title="Authorities contacted according to fraud",height=700,width=700)
Those with others committed more fraud
# boxplot of age according to fraud_reported
px.box(df,y="age",title="Age")
# boxplot of age according to fraud_reported
px.box(df,y="age", color="fraud_reported",title="Age per fraud")
# boxplot of age according to fraud_reported
px.box(df,y="total_claim_amount",title="Total claim amount")
# boxplot of age according to fraud_reported
px.box(df,y="total_claim_amount",color='fraud_reported', title="Total claim amount according to fraud")
# Box plot of months as customer
px.box(df,y="months_as_customer",title="months_as_customer")
# boxplot of age according to fraud_reported
px.box(df,y="months_as_customer", color="fraud_reported",title="months_as_customer")
# boxplot of age according to fraud_reported
px.box(df,y="witnesses", color="fraud_reported",title="witnesses")
Some variables have a few outliers as seen from the boxplots. Outliers are data points that are distant from other similar point. One way to handle outliers is through imputation or training data with ensemble models like random forest and gradient boosting which can handle outliers
# Correlation plot
corr= df.corr()
#heat map of correlation
plt.figure(figsize=(15,15))
sns.heatmap(corr, annot=True)
plt.title('Correlation Heatmap', fontdict={'fontsize':24}, pad=12)
plt.show()
df.sample(10)
The data has an unwanted character '?' which needs to be treated
# Dropping unnecessary variables
df = df.drop(columns = ['policy_number', 'policy_bind_date', 'policy_csl','insured_zip', 'incident_date','incident_location', 'policy_state', 'incident_city', 'insured_relationship', 'auto_make', 'auto_model', 'auto_year'])
It is always good to split data before preprocessing to avoid data leakage
# Splitting data into train and test set
df_train, df_test = train_test_split(df, test_size=0.2, random_state=0)
df_train = df_train.replace('?',np.NaN)
df_test = df_test.replace('?', np.NaN)
df_train.isnull().sum()
# Replacing missing values with a new class 'Missing'
for col in df_train:
df_train[col]=df_train[col].fillna('Missing')
for col in df_test:
df_test[col]= df_test[col].fillna('Missing')
df_train.head(5)
# Separating independent variables from the target variable
y_train= df_train.pop('fraud_reported')
X_train = df_train
y_test = df_test.pop('fraud_reported')
X_test = df_test
# Encoding categroical variable with one hot encoder
X_train = pd.get_dummies(X_train) # Encoding the train set
X_test = pd.get_dummies(X_test) # Encoding the test set
X_train.head()
# encoding the target with label encoding
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y_train = labelencoder.fit_transform(y_train)
y_test = labelencoder.transform(y_test)
X_train.shape, X_test.shape # Shape of train and test set
# Scaling train and test set with MinMaxScaler
# MinMaxScaler transform the variables into a range of [0, 1]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(scaler.transform(X_test), columns = X_test.columns)
The dataset in imbalance, SMOTE will help to balance up the classes
# Treating imbalance data in training dataset
from imblearn.over_sampling import SMOTE
from collections import Counter
counter = Counter(y_train)
print('before smoting: ', counter)
smt = SMOTE()
X_train, y_train = smt.fit_resample(X_train, y_train)
counter = Counter(y_train)
print('After smoting: ', counter)
I will apply the following supervised learning models.
(1): Logistic Regression
(2): Decision Tree
(3): Random Forest
(4): Gradient Boosting
index = ["LogisticRegression", "DecisionTreeClassifier", "GradientBoostingClassifier", "RandomForestClassifier"]
results = pd.DataFrame(columns=['Accuracy','Precison','Recall', 'f1_score', 'AUC'],index=index)
Logistic regression is a linear model for classification. It is simple to implement, efficient and fast.
# set tuning paramters
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=0.1, penalty='l1', solver='liblinear')
# Fitting model to train set
LR.fit( X_train, y_train)
# Checking for overfitting and underfitting
print(" Accuracy on training set: ", LR.score( X_train, y_train))
print(" Accuracy on test set: ", LR.score( X_test, y_test))
# Prediction on test set
y_pred = LR.predict(X_test)
print(y_pred)
Model evaluation is the process of evaluating a trained model on the test data set.
Confusion Matrix
A confusion matrix is a table use to evaluate the performance of classification model
It contains the following:
• True Positive (TP): Correct positive prediction
• False Positive (FP): Incorrect positive prediction
• True negative (TN): Correct negative prediction
• False negative (FN): Incorrect negative prediction
Recall or sensitivity: Recall is the true positive predictions out of the total number of actual positive
Recall = TP/ (TP +FN)
Precision: It represent the number of true positive predictions divide by the total number of positive predictions.
Precision= TP/ (TP+FP)
Accuracy: Accuracy is the total number of correct predictions divide by the total predictions
Accuracy = (TP+TN) / (TP+TN+FN+FP)
f1_score: It is the harmonic mean of recall and precision.
f1_score = (2X Precision X Recall)/ (Precision + Recall)
Area Under Curve (AUC)
It represents the area under the ROC curve. An ROC curve is a plot of false positive rate on the x-axis and true positive rate on the y-axis
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm
# Heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# Accuracy of test
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score
rec = recall_score(y_test, y_pred)
pre = precision_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
f1_sc = f1_score(y_test, y_pred)
print("Accuracy :: ",acc)
print("Precision :: ",pre)
print("Recall :: ", rec)
print("f1_score", f1_sc)
# ROC Curve, AUC
from sklearn import metrics
y_pred_proba = LR.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
# calssification report
print(classification_report(y_test, y_pred))
results.loc["LogisticRegression"] = [acc,pre,rec, f1_sc, auc]
Decision trees are a widely used model for classification and regression problems. A decision tree is made up of the decision node and leaf node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches.
# set tuning paramters
tree = DecisionTreeClassifier(max_depth=2, random_state=0)
# Fitting model to train set
tree.fit( X_train, y_train)
# Checking for overfitting and underfitting
print(" Accuracy on training set: ", tree.score( X_train, y_train))
print(" Accuracy on test set: ", tree.score( X_test, y_test))
# Prediction on test set
y_pred1 = tree.predict(X_test)
print(y_pred)
rec1 = recall_score(y_test, y_pred1)
pre1 = precision_score(y_test, y_pred1)
acc1 = accuracy_score(y_test, y_pred1)
f1_sc1 = f1_score(y_test, y_pred1)
print("Accuracy :: ",acc1)
print("Precision :: ",pre1)
print("Recall :: ", rec1)
print("f1_score", f1_sc1)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm
# Heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# ROC Curve, AUC
from sklearn import metrics
y_pred_proba = tree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc1 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
# calssification report
print(classification_report(y_test, y_pred1))
# Feature importance
plt.figure(figsize=(15, 5))
importances = tree.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()
results.loc["DecisionTreeClassifier"] = [acc1,pre1,rec1, f1_sc1, auc1]
A random forest is essentially a collection of decision trees, where each tree is slightly different from the others. The idea behind random forests is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results.
from sklearn.ensemble import RandomForestClassifier
RF = RandomForestClassifier(max_depth=2, n_estimators=5, random_state=0)
# Fitting model to train set
RF.fit( X_train, y_train)
# Checking for overfitting and underfitting
print(" Accuracy on training set: ", RF.score( X_train, y_train))
print(" Accuracy on test set: ", RF.score( X_test, y_test))
# Prediction on test set
y_pred3 = RF.predict(X_test)
print(y_pred)
rec2 = recall_score(y_test, y_pred3)
pre2 = precision_score(y_test, y_pred3)
acc2 = accuracy_score(y_test, y_pred3)
f1_sc2 = f1_score(y_test, y_pred3)
print("Accuracy :: ",acc2)
print("Precision :: ",pre2)
print("Recall :: ", rec2)
print("f1_score", f1_sc2)
# Confusion matrix
cm = confusion_matrix(y_test, y_pred3)
cm
# Heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# ROC Curve, AUC
from sklearn import metrics
y_pred_proba = tree.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc2 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
# calssification report
print(classification_report(y_test, y_pred3))
# Feature importance
plt.figure(figsize=(15, 5))
importances = RF.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()
results.loc["RandomForestClassifier"] = [acc2,pre2,rec2, f1_sc2, auc2]
Gradient boosting machines is an ensemble method that combines multiple decision trees to create a more powerful model. Despite the “regression” in the name, these models can be used for regression and classification.
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(max_depth=2, learning_rate=0.01, random_state=0)
# Fitting model to train set
gbrt.fit( X_train, y_train)
# Checking for overfitting and underfitting
print(" Accuracy on training set: ", gbrt.score( X_train, y_train))
print(" Accuracy on test set: ", gbrt.score( X_test, y_test))
# Prediction on test set
y_pred4 = gbrt.predict(X_test)
print(y_pred4)
rec3 = recall_score(y_test, y_pred4)
pre3 = precision_score(y_test, y_pred4)
acc3 = accuracy_score(y_test, y_pred4)
f1_sc3 = f1_score(y_test, y_pred4)
print("Accuracy :: ",acc)
print("Precision :: ",pre)
print("Recall :: ", rec)
print("f1_score", f1_sc)
cm = confusion_matrix(y_test, y_pred4)
cm
# Heatmap
plt.figure(figsize=(8, 5))
sns.heatmap(cm, cmap= 'Blues', linecolor='black', fmt='', annot=True)
plt.title('confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()
# ROC Curve, AUC
from sklearn import metrics
y_pred_proba = gbrt.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc3 = metrics.roc_auc_score(y_test, y_pred_proba)
print('auc: ', auc)
#create ROC curve
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC')
plt.legend(loc=4)
plt.show()
# calssification report
print(classification_report(y_test, y_pred4))
results.loc["GradientBoostingClassifier"] = [acc3,pre3,rec3, f1_sc3, auc3]
# Feature importance
plt.figure(figsize=(15, 5))
importances = gbrt.feature_importances_
feature_importance = pd.Series(importances, index = X_train.columns)
feature_importance.plot(kind='bar')
plt.title('Feature importance')
plt.show()
results = results*100
results
px.bar(results,y ="Accuracy",x = results.index,color = results.index,title="Accuracy Comparison")
px.bar(results,y ="Precison",x = results.index,color = results.index,title="Precision Comparison")
px.bar(results,y ="Recall",x = results.index,color = results.index,title="Recall Comparison")
px.bar(results,y ="f1_score",x = results.index,color = results.index,title="f1_score comparison")
Pros
• Simple algorithm that is easy to implement, does not require high computation power.
• Performs extremely well when the data is linearly separable.
• Less prone to over-fitting, with low-dimensional data.
Cons
• Poor performance on non-linear data
• Poor performance with highly correlated features.
Pros
• Do not need to scale and normalize data.
• Handles missing values very well.
• Less effort in regard to preprocessing.
Cons
• Very prone to overfitting.
• Sensitive to outliers and changes in the data.
Pros
• It reduces risk of overfitting since it is an ensemble of decision trees. For predicting the outcome, random forest takes inputs from all the trees and then predicts the outcome
• Excellent handling of missing data.
• Good Performance on Imbalanced datasets. It can also handle errors in imbalanced data (one class is majority and other class is minority).
• It can handle huge amount of data with higher dimensionality of variables.
• Little impact of outliers
• Useful to extract feature importance (can be used for feature selection).
Cons
• Random forests do not tend to perform well on very high dimensional, sparse data, such as text data.
Pros
• Less feature engineering required (No need for scaling, normalizing data, can also handle missing values well.
• Fast to interpret
• Outliers have minimal impact.
• Good model performance
• Less prone to overfitting
Cons
• Overfitting possible if parameters not tuned properly.